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Abstract 

Pre-training general-purpose visual features with con¬ 
volutional neural networks without relying on annotations 
is a challenging and important task. Most recent efforts in 
unsupervised feature learning have focused on either small 
or highly curated datasets like ImageNet, whereas using 
non-curated raw datasets was found to decrease the fea¬ 
ture quality when evaluated on a transfer task. Our goal is 
to bridge the performance gap between unsupervised meth¬ 
ods trained on curated data, which are costly to obtain, and 
massive raw datasets that are easily available. To that ef¬ 
fect, we propose a new unsupervised approach which lever¬ 
ages self-supervision and clustering to capture complemen¬ 
tary statistics from large-scale data. We validate our ap¬ 
proach on 96 million images from YFCC100M [44], achiev¬ 
ing state-of-the-art results among unsupervised methods on 
standard benchmarks, which confirms the potential of un¬ 
supervised learning when only non-curated raw data are 
available. We also show that pre-training a supervised 
VGG-16 with our method achieves 74.9% top- 1 classifica¬ 
tion accuracy on the validation set of ImageNet, which is an 
improvement of 4- 0.8% over the same network trained from 
scratch. Our code is available at https: //github . 
com/ facebookre search /Deeper Cluster. 


1. Introduction 

Pre-trained convolutional neural networks, or convnets, 
are important components of image recognition applica¬ 
tions [7, 8, 40, 49]. They improve the generalization of 
models trained on a limited amount of data [41] and speed 
up the training on applications when annotated data is abun¬ 
dant [2' ]. Convnets produce good generic representations 
when they are pre-trained on large supervised datasets like 
ImageNet [1 ]. However, designing such fully-annotated 
datasets has required a significant effort from the research 
community in terms of data cleansing and manual labeling. 



Figure 1: Influence of amount of data (left) and number of 
clusters (right) on the features quality. We report validation 
mAP on Pascal VOC classification task (fc68 setting). 

Scaling up the annotation process to datasets that are orders 
of magnitude bigger raises important difficulties. Using 
raw metadata as an alternative has been shown to perform 
comparatively well [23, 43], even surpassing ImageNet pre¬ 
training when trained on billions of images [30]. However, 
metadata are not always available, and when they are, they 
do not necessarily cover the full extent of a dataset. These 
difficulties motivate the design of methods that learn trans¬ 
ferable features without using any annotation. 

Recent works describing unsupervised approaches have 
reported performances that are closing the gap with their 
supervised counterparts [6, 15, f ]. However, the best per¬ 
forming unsupervised methods are trained on ImageNet, a 
curated dataset made of carefully selected images to form 
well-balanced and diversified classes [1 ]. Simply dis¬ 
carding the labels does not undo this careful selection, as 
it only removes part of the human supervision. Because 
of that, previous works that have experimented with non- 
curated raw data report a degradation of the quality of fea¬ 
tures [6, 12]. In this work, we aim at learning good visual 
representations from unlabeled and non-curated datasets. 
We focus on the YFCC100M dataset [44], which contains 
99 million images from the Flickr photo-sharing website. 
This dataset is unbalanced, with a “long-tail” distribution of 
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hashtags contrasting with the well-behaved label distribu¬ 
tion of ImageNet (see Appendix). For example, guenon and 
baseball correspond to labels with 1300 associated images 
in ImageNet, while there are respectively 226 and 256, 758 
images associated with these hashtags in YFCC100M. Our 
goal is to understand if trading manually-curated data for 
scale leads to an improvement in the feature quality. 

We propose a new unsupervised approach specifically 
designed to leverage large amount of raw data. Indeed, 
training on large-scale non-curated data requires (i) model 
complexity to increase with dataset size; (ii) model stability 
to data distribution changes. A simple yet effective solution 
is to combine methods from two domains of unsupervised 
learning: clustering and self-supervision. Since cluster¬ 
ing methods, like DeepCluster [6], build supervision from 
inter-image similarities, the task at hand becomes inher¬ 
ently more complex when the number of images increases. 
In addition, DeepCluster captures finer relations between 
images when the number of clusters scales with the dataset 
size. Clustering approaches infer target labels at the same 
time as features are learned. Thus, target labels evolve dur¬ 
ing training, making clustering-based approaches unstable. 
Furthermore, these methods are sensitive to data distribu¬ 
tion as they rely directly on cluster structure in the underly¬ 
ing data. Explicitly dealing with unbalanced category dis¬ 
tribution might be a solution but it assumes that we know 
the distribution of the latent classes. We design our method 
without this assumption. On the other hand, self-supervised 
learning [10] consists in designing a pretext task by predict¬ 
ing pseudo-labels automatically extracted from input sig¬ 
nals [12]. In other words, self-supervised approaches, like 
RotNet [15], leverage intra-image statistics to build super¬ 
vision, which are often independent of the data distribu¬ 
tion. However, the dataset size has little impact on the 
nature of the task and on the performance of the result¬ 
ing features (see Figure 1). A solution to leveraging larger 
datasets require manually increasing the difficulty of the 
self-supervision task [19]. Our approach automatically in¬ 
creases complexity through the clustering strategy. 
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Table 1: Training on non-curated large-scale data requires 
model complexity to increase with dataset size and model 
stability to data distribution changes. A simple solution is 
to combine self-supervision and clustering. 

The novelty of our method lies in the combination of 
these two paradigms (Table 1) so that they benefit from one 


another. Our approach, DeeperCluster, automatically gen¬ 
erates targets by clustering the features of the entire dataset, 
under constraints derived from self-supervision. Due to the 
“long-tail” distribution of raw non-curated data, processing 
huge datasets and learning a large number of targets is nec¬ 
essary, making the problem challenging from a computa¬ 
tional point of view. For this reason, we propose a hier- 
achical formulation that is suitable for distributed training. 
This enables the discovery of latent categories present in 
the “tail” of the image distribution. While our framework 
is general, in practice we focus on combining the large ro¬ 
tation classification task of Gidaris et al. [15] with the clus¬ 
tering approach of Caron et al. [6]. Figure 1 left shows that 
as we increase the number of training images, the quality 
of features improves to the point where it surpasses those 
trained without labels on curated datasets. More impor¬ 
tantly, we evaluate the quality of our approach as a pre¬ 
training step for ImageNet classification. Pre-training a su¬ 
pervised VGG-16 with our unsupervised approach leads to 
a top-1 accuracy of 74.9%, which is an improvement of 
+0.8% over a model trained from scratch. This shows the 
potential of unsupervised pre-training on large non-curated 
datasets as a way to improve the quality of visual features. 

2. Related Work 

Self-supervision. Self-supervised learning builds a pre¬ 
text task from the input signal to train a model with¬ 
out annotation [1C]. Many pretext tasks have been pro¬ 
posed [22, 31, 47, 5 ], exploiting, amongst others, spatial 
context [12, 24, 33, 34, 36], cross-channel prediction [27, 
28, 55, 56], or the temporal structure of videos [1, 35, 46]. 
Some pretext tasks explicitly encourage the representations 
to be either invariant or discriminative to particular types of 
input tranformations. For example, Dosovitskiy et al. [13] 
consider each image and its transformations as a class to en¬ 
force invariance to data transformations. In this paper, we 
build upon the work of Gidaris et al. [F ] where the model 
encourages features to be discriminative for large rotations. 
Recently, Kolesnikov et al. [2 ] have conducted an exten¬ 
sive benchmark of self-supervised learning methods on dif¬ 
ferent convnet architectures. As opposed to our work, they 
use curated datasets for pre-training. 

Deep clustering. Clustering, along with density estima¬ 
tion and dimensionality reduction, is a family of standard 
unsupervised learning methods. Various attempts have been 
made to train convnets using clustering [2, 3, 6, 29, 48, 52, 
53]. Our paper builds upon the work of Caron et al. [6], 
in which k -means is used to cluster the visual represen¬ 
tations. Unlike our work, they mainly focus on training 
their approach using ImageNet without labels. Recently, 
Noroozi et al. [34] show that clustering can also be used as a 
form of distillation to improve the performance of networks 






trained with self-supervision. As opposed to our work, they 
use clustering only as a post-processing step and does not 
leverage the complementarity between clustering and self¬ 
supervision to further improve the quality of features. 

Learning on non-curated datasets. Some methods [9, 
17, 3 ] aim at learning visual features from non-curated 
data streams. They typically use metadata such as hash- 
tags [23, 43] or geolocalization [50] as a source of noisy su¬ 
pervision. In particular, Mahajan et al. [3( ] train a network 
to classify billions of Instagram images into predefined and 
clean sets of hashtags. They show that with little human ef¬ 
fort, it is possible to learn features that transfer well to Im- 
ageNet, even achieving state-of-the-art performance if fine- 
tuned. As opposed to our work, they use an extrinsic source 
of supervision that had to be cleaned beforehand. 

3. Preliminaries 

In this work, we refer to the vector obtained at the 
penultimate layer of the convnet as a feature or represen¬ 
tation. We denote by fe the feature-extracting function, 
parametrized by a set of parameters 6. Given a set of im¬ 
ages, our goal is then to learn a “good” mapping fe *. By 
“good”, we mean a function that produces general-purpose 
visual features that are useful on downstream tasks. 

3.1. Self-supervision 

In self-supervised learning, a pretext task is used to ex¬ 
tract target labels directly from data [12]. These targets can 
take a variety of forms. They can be categorical labels as¬ 
sociated with a multiclass problem, as when predicting the 
transformation of an image [15, 54] or the ordering of a set 
of patches [3 ]. Or they can be continuous variables asso¬ 
ciated with a regression problem, as when predicting image 
color [55] or surrounding patches [36]. In this work, we are 
interested in the former. We suppose that we are given a set 
of N images {x \,..., xn} and we assign a pseudo-label y n 
in y to each input x n . Given these pseudo-labels, we learn 
the parameters 0 of the convet jointly with a linear classifier 
V to predict pseudo-labels by solving the problem 

1 N 

min — ^l(y n ,V fg(x n )), (1) 

n= 1 

where £ is a loss function. The pseudo-labels y n are fixed 
during the optimization and the quality of the learned fea¬ 
tures entirely depends on their relevance. 


since its performance on standard evaluation benchmarks 
is among the best in self-supervised learning. This pretext 
task corresponds to a multiclass classification problem with 
four categories: rotations in {0°, 90°, 180°, 270°}. Each in¬ 
put x n in Eq. (1) is randomly rotated and associated with a 
target y n that represents the angle of the applied rotation. 

3.2. Deep clustering 

Clustering-based approaches for deep networks typically 
build target classes by clustering visual features produced 
by convnets. As a consequence, the targets are updated 
during training along with the representations and are po¬ 
tentially different at each epoch. In this context, we define 
a latent pseudo-label z n in Z for each image n as well as 
a corresponding linear classifier W. These clustering-based 
methods alternate between learning the parameters 0 and W 
and updating the pseudo-labels z n . Between two reassign¬ 
ments, the pseudo-labels z n are fixed, and the parameters 
and classifier are optimized by solving 

1 N 

mi n-^^£(z n ,Wf 0 (x n )), (2) 

n=l 

which is of the same form as Eq. (1). Then, the pseudo¬ 
labels z n can be reassigned by minimizing an auxiliary loss 
function. This loss sometimes coincides with Eq. (2) [3, 52] 
but some works proposed to use another objective [6, 5 ■ ]. 


Updating the targets with k- means. In this work, we 
focus on the framework of Caron et al. [6], Deep Cluster, 
where latent targets are obtained by clustering the activa¬ 
tions with /c-means. More precisely, the targets z n are up¬ 
dated by solving the following optimization problem: 


N 

min 

CeRdXk 


n= 1 L 


.m in T1 \\Cz n -f e (x n )\\ 
Zn e{ o,i} fc s . t . 1=1 


, (3) 


C is the matrix where each column corresponds to a cen¬ 
troid, k is the number of centroids, and z n is a binary vector 
with a single non-zero entry. This approach assumes that 
the number of clusters k is known a priori ; in practice, we 
set it by validation on a downstream task (see Sec. 5.3). The 
latent targets are updated every T epochs of stochastic gra¬ 
dient descent steps when minimizing the objective (2). 

Note that this alternate optimization scheme is prone to 
trivial solutions and controlling the way optimization pro¬ 
cedures of both objectives interact is crucial. Re-assigning 
empty clusters and performing a batch-sampling based on 
an uniform distribution over the cluster assignments are 
workarounds to avoid trivial parametrization [6] . 


Rotation as self-supervision. Gidaris et al. [15] have re¬ 
cently shown that good features can be obtained when train¬ 
ing a convnet to discriminate between different image rota¬ 
tions. In this work, we focus on their pretext task, RotNet, 


4. Method 

In this section, we describe how we combine self- 
supervised learning with deep clustering in order to scale 




up to large numbers of images and targets. 

4.1. Combining self-supervision and clustering 

We assume that the inputs xi,...,xn are rotated im¬ 
ages, each associated with a target label y n encoding its 
rotation angle and a cluster assignment z n . The cluster as¬ 
signment changes during training along with the visual rep¬ 
resentations. We denote by y the set of possible rotation 
angles and by Z, the set of possible cluster assignments. 
A way of combining self-supervision with deep clustering 
is to add the losses defined in Eq. (1) and Eq. (2). How¬ 
ever, summing these losses implicitly assumes that classi¬ 
fying rotations and cluster memberships are two indepen¬ 
dent tasks, which may limit the signal that can be captured. 
Instead, we work with the Cartesian product space y x Z, 
which can potentially capture richer interactions between 
the two tasks. We get the following optimization problem: 

1 N 

min — y't{y n ® z n ,Wf e {x n )). (4) 

9,W iV *—' 
n= 1 

Note that any clustering or self-supervised approach with 
a multiclass objective can be combined with this formula¬ 
tion. For example, we could use a self-supervision task that 
captures information about tiles permutations [33] or frame 
ordering in a video [4 ]. However, this formulation does not 
scale in the number of combined targets, i.e., its complexity 
is 0(\y | \Z\). This limits the use of a large number of clus¬ 
ter or a self-supervised task with a large output space [54 ]. 
In particular, if we want to capture information contained 
in the tail of the distribution of non-curated dataset, we may 
need a large number of clusters. We thus propose an approx¬ 
imation of our formulation based on a scalable hierarchical 
loss that it is designed to suit distributed training. 

4.2. Scaling up to large number of targets 
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Figure 2: DeeperCluster alternates between a hierachical 
clustering of the features and learning the parameters of a 
convnet by predicting both the rotation angle and the cluster 
assignments in a single hierachical loss. 


by the vector in {0, l} ks of the assignment into k s sub¬ 
classes for an image n belonging to super-class s. There 
are S sub-class classifiers W \,..., Ws, each predicting the 
sub-class memberships within a super-class s. The param¬ 
eters of the linear classifiers (V, Wi ,..., Ws) and 6 are 
jointly learned by minimizing the following loss function: 



n= 1 


S 

t(Vfo(x n ), Vn) + Z y™ £ (WsfoM, O 

s= 1 


(5) 


where £ is the negative log-softmax function. Note that an 
image that does not belong to the super-class s does not 
belong either to any of its k s sub-classes. 


Hierarchical losses are commonly used in language mod¬ 
eling where the goal is to predict a word out of a large vo¬ 
cabulary [5]. Instead of making one decision over the full 
vocabulary, these approaches split the process in a hierarchy 
of decisions, each with a smaller output space. For exam¬ 
ple, the vocabulary can be split into clusters of semantically 
similar words, and the hierarchical process would first se¬ 
lect a cluster and then a word within this cluster. 

Following this line of work, we partition the target labels 
into a 2-level hierarchy where we first predict a super-class 
and then a sub-class among its associated target labels. The 
first level is a partition of the images into S super-classes 
and we denote by y n the super-class assignment vector in 
{0,1} 5 of the image n and by y ns the s-th entry of y n . This 
super-class assignment is made with a linear classifier V on 
top of the features. The second-level of the hierarchy is ob¬ 
tained by partitioning within each super-class. We denote 


Choice of super-classes. A natural partition would be to 
define the super-classes based on the target labels from the 
self-supervised task and the sub-classes as the labels pro¬ 
duced by clustering. However, this would mean that each 
image of the entire dataset would be present in each super¬ 
class (with a different rotation), which does not take advan¬ 
tage of the hierarchical structure to use a bigger number of 
clusters. 

Instead, we split the dataset into m sets by running k- 
means with m centroids on the full dataset every T epochs. 
We then use the Cartesian product between the assignment 
to these m clusters and the angle rotation classes to form 
the super-classes. There are 4m super-classes, each associ¬ 
ated with the subset of data belonging to the corresponding 
cluster (N/m images if the clustering is perfectly balanced). 
These subsets are then further split with k -means into k sub¬ 
classes. This is equivalent to running a hierarchical k -means 
















































with rotation constraints on the full datasets to form our hi¬ 
erarchical loss. We typically use m = 4 and k = 80k, lead¬ 
ing to a total of 320k different clusters split in 4 subsets. Our 
approach, “DeeperCluster”, shares similarities with Deep- 
Cluster but is designed to scale to larger datasets. We alter¬ 
nate between clustering the non-rotated images features and 
training the network to predict both the rotation applied to 
the input data and its cluster assignment amongst the clus¬ 
ters corresponding to this rotation (Figure 2). 

Distributed training. Building the super-classes based 
on data splits lends itself to a distributed implementation 
that scales well in the number of images. Specifically, when 
optimizing Eq. (5), we form as many distributed communi¬ 
cation groups of p GPUs as the number of super-classes, 
i.e., G = Am. Different communication groups share the 
parameters 6 and the super-class classifier V, while the pa¬ 
rameters of the sub-class classifiers W± ,..., Ws are only 
shared within a communication group. Each communica¬ 
tion group s deals only with the subset of images and the 
rotation angle associated with the super-class s. 

Distributed k- means. Every T epochs, we recompute the 
super and sub-class assignments by running two consecu¬ 
tive fc-means on the entire dataset. This is achieved by first 
randomly splitting the dataset across different GPUs. Each 
GPU is in charge of computing cluster assignments for its 
partition, whereas centroids are updated across GPUs. We 
reduce communication between GPUs by sharing only the 
number of assigned elements for each cluster and the sum 
of their features. The new centroids are then computed from 
these statistics. We observe empirically that k -means con¬ 
verges in 10 iterations. We cluster 96M features of dimen¬ 
sion 4096 into m = 4 clusters using 64 GPUs (1 minute per 
iteration). Then, we split this pool of GPUs into 4 groups 
of 16 GPUs. Each group clusters around 23M features into 
80k clusters (4 minutes per iteration). 

4.3. Implementation details 

The loss in Eq. (5) is minimized with mini-batch stochas¬ 
tic gradient descent [4]. Each mini-batch contains 3072 in¬ 
stances distributed accross 64 GPUs, leading to 48 instances 
per GPU per minibatch [1 ]. We use dropout, weight de¬ 
cay, momentum and a constant learning rate of 0.1. We 
reassign clusters every 3 epochs. We use the Pascal VOC 
2007 classification task without finetuning as a downstream 
task to select hyper-parameters. In order to speed up exper¬ 
imentations, we initialize the network with RotNet trained 
on YFCC100M. Before clustering, we perform a whiten¬ 
ing of the activations and £2 -normalize each of them. We 
use standard data augmentations, i.e., cropping of random 
sizes and aspect ratios and horizontal flips [26]). We use 
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Larsson et al. [28] 

INet+Pl. 

- 

77.2* 

49.2 

59.7 

Wu etal. [51] 

INet 

- 

- 

- 

eo.st 

Doersh et al. [12] 

INet 

54.6 

78.5 

38.0 

62.7 

Caron et al. [6] 

INet 

78.5 

82.5 

58.7 

65.9+ 

Unsupervised on non-curated data 




Mahendran etal. [31] YFCCv 

- 

76.4^ 

- 

- 

Wang and Gupta [46] 

YT8M 

- 

- 

- 

60.2^ 

Wang et al. [47] 

YT9M 

59.4 

79.6 

40.9 

63.2^ 

DeeperCluster 

YFCC 

79.7 

84.3 

60.5 67.8 


Table 2: Comparison of DeeperCluster to state-of-the-art 
unsupervised feature learning on classification and detec¬ 
tion on Pascal VOC 2007. We disassociate methods using 
curated datasets and methods using non-curated datasets. 
We selected hyper-parameters for each transfer task on the 
validation set, and then retrain on both training and val¬ 
idation sets. We report results on the test set averaged 
over 5 runs. “YFFCv” stands for the videos contained in 
YFFC100M dataset, t numbers from their original paper. 

the VGG-16 architecture [42] with batch normalization lay¬ 
ers. Following [3, 6, 37], we pre-process images with a So- 
bel filtering. We train our models on the 96M images from 
YFCC100M [44] that we managed to download. We use 
this publicly available dataset for research purposes only. 

5. Experiments 

In this section we evaluate the quality of the features 
learned with DeeperCluster on a variety of downstream 
tasks, such as classification or object detection. We also 
provide insights about the impact of the number of images 
and clusters on the performance of our model. 

5.1. Evaluating unsupervised features 

We evaluate the quality of the features extracted from a 
convnet trained with DeeperCluster on YFCC100M by con¬ 
sidering several standard transfer learning tasks, namely im¬ 
age classification, object detection and scene classification. 

Pascal VOC 2007 [ 1]. This dataset has small training 
and validation sets (2.5k images each), making it close to 
the setting of real applications where models trained us¬ 
ing large computational resources are adapted to a new task 
with a small number of instances. We report numbers on the 










classification and detection tasks with finetuning (“ALL”) 
or by only retraining the last three fully connected layers 
of the network (“fc68”). The fc68 setting gives a better 
measure of the quality of the evaluated features since fewer 
parameters are retrained. For classification, we use the code 
of Caron et al. [6] 1 and for detection, fast-rcnn [16] 2 . 
For classification, we train the models for 150 k iterations, 
starting with a learning rate of 0.002 decayed by a factor 
10 every 20 k iterations, and we report results averaged over 
10 random crops. For object detection, we train our net¬ 
work for 150 k iterations, dividing the step-size by 10 af¬ 
ter the first 50 k steps with an initial learning rate of 0.01 
(FC68) or 0.002 (ALL) and a weight decay of 0.0001. Fol¬ 
lowing Doersch et al. [12], we use the multiscale config¬ 
uration, with scales [400, 500, 600, 700] for training and 
[400, 500,600] for testing. In Table 2, we compare Deep- 
erCluster with two sets of unsupervised methods that use 
a VGG-16 network: those trained on curated datasets and 
those trained on non-curated datasets. Previous unsuper¬ 
vised methods that worked on unucurated datasets with a 
VGG-16 use videos: Youtube8M (“YT8M”), Youtube9M 
(“YT9M”) or the videos from YFCC100M (“YFFCv”). Our 
approach achieves state-of-the-art performance among all 
the unsupervised method that uses a VGG-16 architecture, 
even those that use ImageNet as a training set. The gap with 
a supervised network is still important when we freeze the 
convolutions (6% for detection and 10% for classification) 
but drops to less than 5% for both tasks with finetuning. 

Linear classifiers on ImageNet [11] and Places205 [57]. 
ImageNet (“INet”) and Places205 (“PL”) are two large scale 
image classification datasets: ImageNet’s domain covers 
objects and animals (1.3M images) and Places205’s domain 
covers indoor and outdoor scenes (2.5M images). We train 
linear classifiers with a logistic loss on top of frozen con¬ 
volutional layers at different depths. To reduce influence of 
feature dimension in the comparison, we average-pool the 
features until their dimension is below 10 k [55]. This ex¬ 
periment probes the quality of the features extracted at each 
convolutional layer. In Figure 3, we observe that Deeper- 
Cluster matches the performance of a supervised network 
for all layers on Places205. On ImageNet, it also matches 
supervised features up to the 4th convolutional block; then 
the gap suddenly increases to around 20%. It is not surpris¬ 
ing since the supervised features are trained on ImageNet 
itself, while ours are trained on YFCC100M. 

5.2. Pre-training for ImageNet 

In the previous section, we can observe that a VGG-16 
trained on YFCC100M has similar or better low level fea¬ 
tures than the same network trained on ImageNet with su- 




Figure 3: Accuracy of linear classifiers on ImageNet and 
Places205 using the activations from different layers as fea¬ 
tures. We compare a VGG-16 trained with supervision on 
ImageNet to VGG-16 trained with either RotNet or Deep- 
erCluster on YFCC100M. Exact numbers are in Appendix. 


pervision. In this experiment, we want to check whether 
these low-level features pre-trained on YFCC100M with¬ 
out supervision can serve as a good initialization for fully- 
supervised ImageNet classification. To this end, we pre¬ 
train a VGG-16 on YFCC100M using either DeeperClus- 
ter or RotNet. The resulting weights are then used as ini¬ 
tialization for the training of a network on ImageNet with 
supervision. We merge the Sobel weights of the network 
pre-trained with DeeperCluster with the first convolutional 
layer during the initialization. We then train the networks 
on ImageNet with mini-batch SGD for 100 epochs, a learn¬ 
ing rate of 0.1, a weight decay of 0.0001, a batch size 
of 256 and dropout of 0.5. We reduce the learning rate 
by a factor of 0.2 every 20 epochs. Note that this learn¬ 
ing rate decay schedule slightly differs from the ImageNet 
classification PyTorch default implementation 3 where they 
train for 90 epochs and decay the learning rate by 0.1 at 
epochs 30 and 60. We give in Appendix the results with 
this default schedule (with unchanged conclusions). In Ta¬ 
ble 3, we compare the performance of a network trained 
with a standard intialization (“Supervised”) to one initial¬ 
ized with a pre-training obtained from either DeeperClus¬ 
ter (“Supervised + DeeperCluster pre-training”) or RotNet 
(“Supervised + RotNet pre-training”) on YFCC100M. We 
see that our pre-training improves the performance of a su¬ 
pervised network by +0.8%, leading to 74.9% top-1 accu¬ 
racy. This means that our pre-training captures important 
statistics from YFCC100M that transfers well to ImageNet. 

5.3. Model analysis 

In this final set of experiments, we analyze some compo¬ 
nents of our model. Since DeeperCluster derives from Rot¬ 
Net and DeepCluster, we first look at the difference between 
these methods and ours, when trained on curated and non- 


%ithub .com/facebookresearch/deepcluster 
^github .com/rbgirshick/py-faster-rcnn 


^ github.com/pytorch/examples/blob/master/imagenet/ 
^py torch.org/docs/stable/torchvision/modeIs 


































ImageNet 

top-1 

top-5 

Supervised (PyTorch documentation 4 ) 

73.4 

91.5 

Supervised (our code) 

74.1 

91.8 

Supervised + RotNet pre-training 

74.5 

92.0 

Supervised + DeeperCluster pre-training 

74.9 

92.3 


Table 3: Accuracy on the validation set of ImageNet classi¬ 
fication for a supervised VGG-16 trained with different ini¬ 
tializations: we compare a network trained from a standard 
initialization to networks trained from pre-trained weights 
using either DeeperCluster or RotNet on YFCC100M. 


Method 

Data 

ImageNet 

Places 

VOC2007 

Supervised 

ImageNet 

70.2 

45.9 

84.8 

Wu etal [51] 

ImageNet 

39.2 

36.3 

- 

RotNet 

ImageNet 

32.7 

32.6 

60.9 

DeepCluster 

ImageNet 

48.4 

37.9 

71.9 

RotNet 

YFCC100M 

33.0 

35.5 

62.2 

DeepCluster 

YFCC100M 

34.1 

35.4 

63.9 

DeeperCluster 

YFCC100M 

45.6 

42.1 

73.0 


Table 4: Comparaison between DeeperCluster, RotNet and 
DeepCluster when pre-trained on curated and non-curated 
dataset. We report the accuracy on several datasets of a 
linear classifier trained on top of features of the last con¬ 
volutional layer. All the methods use the same architecture. 
DeepCluster does not scale to the full YFCC100M dataset, 
we thus train it on a random subset of 1.3M images. 

curated datasets. We then report quantitative and qualitative 
evaluations of the clusters obtained with DeeperCluster. 

Comparison with RotNet and DeepCluster. In Table 4, 
we compare DeeperCluster with DeepCluster and RotNet 
when a linear classifier is trained on top of the last convolu¬ 
tional layer of a VGG-16 on several datasets. For reference, 
we also report previously published numbers [51] with a 
VGG-16 architecture. We average-pool the features of the 
last layer resulting in representations of 8192 dimensions. 
Our approach outperforms both RotNet and DeepCluster, 
even when they are trained on curated datasets (except for 
ImageNet classification task where DeepCluster trained on 
ImageNet yields the best performance). More interestingly, 
we see that the quality of the dataset or its scale has little 
impact on RotNet while it has on DeepCluster. This is con¬ 
firming that self-supervised methods are more robust than 
clustering to a change of dataset distribution. 


Influence of dataset size and number of clusters. To 

measure the influence of the number of images on features, 
we train models with 1M, 4M, 20M, and 96M images and 
report their accuracy on the validation set of the Pascal VOC 
2007 classification task (fc68 setting). We also train mod¬ 
els on 20M images with a number of clusters that varies 
from 10k to 160k. For the experiment with a total of 160k 
clusters, we choose m = 2 which results in 8 super-classes. 
In Figure 1 , we observe that the quality of our features im¬ 
proves when scaling both in terms of images and clusters. 
Interestingly, between 4M and 20M of YFCC100M images 
are needed to meet the performance of our method on Ima¬ 
geNet. Augmenting the number of images has a bigger im¬ 
pact than the number of clusters. Yet, this improvement is 
significant since it corresponds to a reduction of more than 
10% of the relative error w.r.t. the supervised model. 

Quality of the clusters. In addition to features, our 
method provides a clustering of the input images. We eval¬ 
uate the quality of these clusters by measuring their cor¬ 
relation with existing partitions of the data. In particular, 
YFCC100M comes with many different metadata. We con¬ 
sider hashtags, users, camera and GPS coordinates. If an 
image has several hashtags, we pick as label the least fre¬ 
quent one in the total hashtag distribution. We also mea¬ 
sure the correlation of ours clusters with labels predicted 
by a classifier trained on ImageNet categories. We use a 
ResNet-50 network [21], pre-trained on ImageNet, to clas¬ 
sify the YFCC100M images and we select those for which 
the confidence in prediction is higher than 75%. This evalu¬ 
ation omits a large amount of the data but gives some insight 
about the quality of our clustering in object classification. 

In Figure 4, we show the evolution during training of the 
normalized mutual information (NMI) between our cluster¬ 
ing and different metadata, and the predicted labels from 
ImageNet. The higher the NMI, the more correlated our 
clusters are to the considered partition. For reference, we 
compute the NMI for a clustering of RotNet features (as it 
corresponds to weights at initialization) and of a supervised 
model. First, it is interesting to observe that our cluster¬ 
ing is improving over time for every type of metadata. One 
important factor is that most of these commodities are cor¬ 
related since a given user takes pictures in specific places 
with probably a single camera and use a preferred fixed set 
of hashtags. Yet, these plots show that our model captures in 
the input signal enough information to predict these meta¬ 
data at least as well as the features trained with supervision. 

We visually assess the consistency of our clusters in Fig¬ 
ure 5. We display 9 random images from 8 manually picked 
clusters. The first two clusters contain a majority of images 
associated with tag from the head (first cluster) and from 
the tail (second cluster) in the YFC100M dataset. Indeed, 
418.538 YFC100M images are associated with the tag cat 
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Figure 4: Normalized mutual information between our clustering and different sorts of metadata: hashtags, user IDs, geo¬ 
graphic coordinates, and device types. We also plot the NMI with an ImageNet classifier labeling. 
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Figure 5: We randomly select 9 images per cluster and indicate the dominant cluster metadata. The bottom row depicts 
clusters pure for GPS coordinates but unpure for user IDs. As expected, they turn out to correlate with tourist landmarks. No 
metadata is used during training. For copyright reasons, we provide in Appendix the photographer username for each image. 


whereas only 384 images contain the tag elephantparade- 
london (0.0004% of the dataset). We also show a cluster 
for which the dominant hashtag does not corrolate visually 
with the content of the cluster. As already mentioned, this 
database is non-curated and contains images that basically 
do not depict anything semantic. The dominant metadata of 
the last cluster in the top row is the device ID CanoScan. As 
this cluster is about drawings, its images have been mainly 
taken with a scanner. Finally, the bottom row depict clusters 
that are pure for GPS coordinates but unpure for user IDs. 
It results in clusters of images taken by many different users 
in the same place: tourist landmarks. 


6. Conclusion 

In this paper, we present an unsupervised approach 
specifically designed to deal with large amount of non- 
curated data. Our method is well-suited for distributed 
training, which allows training on large datasets with 96M 
of images. With such amount of data, our approach sur¬ 
passes unsupervised methods trained on curated datasets, 
which validates the potential of unsupervised learning in 
applications where annotations are scarce or curation is not 
trivial. Finally, we show that unsupervised pre-training im¬ 
proves the performance of a network trained on ImageNet. 
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Figure 6: Comparison of the hashtag distribution in 
YFCC100M with the label distribution in ImageNet. 

Appendix 

1. Evaluating unsupervised features 

Here we provide numbers from Figure 2 in Table 5. 

2. YFCC100M and Imagenet label distribution 

YFCC100M dataset contains social media from the 
Flickr website. The content of this dataset is very unbal¬ 
anced, with a “long-tail” distribution of hashtags contrast¬ 
ing with the well-behaved label distribution of ImageNet as 
can be seen in Figure 6. For example, guenon and baseball 
correspond to labels with 1300 associated images in Ima- 
geNet, while there are respectively 226 and 256, 758 images 
associated with these hashtags in YFCC100M. 

3. Pre-training for ImageNet 

In Table 6, we compare the performance of a network 
trained with supervision on ImageNet with a standard in- 
tialization (“Supervised”) to one pre-trained with Deeper- 
Cluster (“Supervised + DeeperCluster pre-training”) and to 
one pre-trained with RotNet (“Supervised + RotNet pre¬ 
training”). The convnet is finetuned on ImageNet with su¬ 
pervision with mini-batch SGD following the hyperparam¬ 
eters of the ImageNet classification example implementa¬ 
tion from PyTorch documentation 5 6 ). Indeed, we train for 
90 epochs (instead of 100 epochs in Table 3 of the main 
paper). We use a learning rate of 0.1, a weight decay of 
0.0001, a batch size of 256 and dropout of 0.5. We reduce 
the learning rate by a factor of 0.1 at epochs 30 and 60 (in¬ 
stead of decaying the learning rate with a factor 0.2 every 
20 epochs in Table 3 of the main paper). This setting is un¬ 
fair towards the supervised from scratch baseline since as 
we start the optimization with a good initialization we ar¬ 
rive at convergence earlier. Indeed, we observe that the gap 
between our pretraining and the baseline shrinks from 1.0 

5 py torch.org/docs/stable/torchvision/modeIs 

6 github .com/pytorch/examples/blob/master/ 
imagenet/main.py 


to 0.8 when evaluating at convergence instead of evaluat¬ 
ing before convergence. As a matter of fact, the gap for the 
RotNet pretraining with the baseline remains the same: 0.4. 

4. Model analysis 
4.1. Instance retrieval 

Instance retrieval consists of retrieving from a corpus the 
most similar images to a given a query. We follow the exper¬ 
imental setting of Tolias et al. [45]: we apply R-MAC with 
a resolution of 1024 pixels and 3 grid levels and we report 
mAP on instance-level image retrieval on Oxford Build¬ 
ings [3^ ] and Paris [39] datasets. 

As described by Dosovitskiy et al. [13], class-level su¬ 
pervision induces invariance to semantic categories. This 
property may not be beneficial for other computer vision 
tasks such as instance-level recognition. For that reason, de¬ 
scriptor matching and instance retrieval are tasks for which 
unsupervised feature learning might provide performance 
improvements. Moreover, these tasks constitute evaluations 
that do not require any additionnal training step, allowing a 
straightforward comparison accross different methods. We 
evaluate our method and compare it to previous work fol¬ 
lowing the experimental setup proposed by Caron et al. [6]. 
We report results for the instance retrieval task in Table 7. 

We observe that features trained with RotNet have 
significantly worse performance than DeepCluster both 
on Oxford5K and Paris6K. This performance discrepancy 
means that properties acquired by classifying large rotations 
are not relevant to instance retrieval. An explanation is that 
all images in Oxford5k and Paris6k have the same orienta¬ 
tion as they picture buildings and landmarks. As our method 
is a combination of the two paradigms, it suffers an im¬ 
portant performance loss on Oxfork5K, but is not affected 
much on Paris6k. These results emphasize the importance 
of having a diverse set of benchmarks to evaluate the quality 
of features produced by unsupervised learning methods. 



Figure 7: Sorted standard deviations to clusters mean col¬ 
ors. If the standard deviation of a cluster to its mean color 
is low, the images of this cluster have a similar colorization. 

















Method 

convl 

conv2 

conv3 

conv4 

conv5 

conv6 

conv7 

conv8 

conv9 

convlO 

convl1 

convl2 

convl3 

ImageNet 

Supervised 

7.8 

12.3 

15.6 

21.4 

24.4 

24.1 

33.4 

41.1 

44.7 

49.6 

61.2 

66.0 

70.2 

RotNet 

10.9 

15.7 

17.2 

21.0 

27.0 

26.6 

26.7 

33.5 

35.2 

33.5 

39.6 

38.2 

33.0 

DeeperCluster 

7.4 

9.6 

14.9 

16.8 

26.1 

29.2 

34.2 

41.6 

43.4 

45.5 

49.0 

49.2 

45.6 

Places205 

Supervised 

10.5 

16.4 

20.7 

24.7 

30.3 

31.3 

35.0 

38.1 

39.5 

40.8 

45.4 

45.3 

45.9 

RotNet 

13.9 

19.1 

22.5 

24.8 

29.9 

30.8 

32.5 

35.3 

36.0 

36.1 

38.8 

37.9 

35.5 

DeeperCluster 

12.7 

14.8 

21.2 

23.3 

30.5 

32.6 

34.8 

39.5 

40.8 

41.6 

44.0 

44.0 

42.1 


Table 5: Accuracy of linear classifiers on ImageNet and Places205 using the activations from different layers as features. 
We train a linear classifier on top of frozen convolutional layers at different depths. We compare a VGG-16 trained with 
supervision on ImageNet to VGG-16s trained with either RotNet or our approach on YFCC100M. 



Py Torch doc 
hyperparam 

Our 

hyperparam 

Supervised (Py Torch documentation 5 ) 

73.4 

- 

Supervised (our code) 

73.3 

74.1 

Supervised + RotNet pre-training 

73.7 

74.5 

Supervised + DeeperCluster pre-training 

74.3 

74.9 


Table 6: Top-1 accuracy on validation set of a VGG-16 
trained on ImageNet with supervision with different ini¬ 
tializations. We compare a network initialized randomly to 
networks pre-trained with our unsupervised method or with 
RotNet on YFCC100M. 


Method 

Pretraining 

Oxford5K 

Paris6K 

ImageNet labels 

ImageNet 

72.4 

81.5 

Random 

- 

6.9 

22.0 

Doersch et al. [12] 

ImageNet 

35.4 

53.1 

Wang et al. [47] 

Youtube 9M 

42.3 

58.0 

RotNet 

ImageNet 

48.2 

61.1 

DeepCluster 

ImageNet 

61.1 

74.9 

RotNet 

YFCC100M 

46.5 

59.2 

DeepCluster 

YFCC100M 

57.2 

74.6 

DeeperCluster 

YFCC100M 

55.8 

73.4 


Table 7: mAP on instance-level image retrieval on Oxford 
and Paris dataset. We apply R-MAC with a resolution of 
1024 pixels and 3 grid levels [45]. We disassociate the 
methods using unsupervised ImageNet and the methods us¬ 
ing non-curated datasets. DeepCluster does not scale to the 
full YFCC100M dataset, we thus train it on a random subset 
of 1.3M images. 


Method 

Data 

RGB 

Sobel 

RotNet 

YFCC 1M 

69.8 

70.4 

DeeperCluster 

YFCC 20M 

71.6 

76.1 


Table 8: Influence of applying Sobel filter or using raw 
RGB input on the features quality. We report validation 
mAP on Pascal VOC classification task (fc68 setting). 


4.2. Influence of data pre-processing 

In this section we experiment with our method on raw 
RGB inputs. We provide some insights into the reasons why 
sobel filtering is crucial to obtain good performance with 
our method. 

First, in Figure 7, we randomly select a subset of 3000 
clusters and sort them by standard deviation to their mean 
color. If the standard deviation of a cluster to its mean color 
is low, it means that the images of this cluster tend to have a 
similar colorization. Moreover, we show in Figure 8 some 
clusters with a low standard deviation to the mean color. We 
observe in Figure 7 that the clustering on features learned 
with our method focuses more on color than the clustering 
on RotNet features. Indeed, clustering by color and low- 
level information produces balanced clusters that can easily 
be predicted by a convnet. Clustering by color is a solution 
to our formulation. However, as we want to avoid an unin¬ 
formative clustering essentially based on colors, we remove 
some part of the input information by feeding the network 
with the image gradients instead of the raw RGB image (see 
Figure 9). This allows to greatly improve the performance 
of our features when evaluated on downstream tasks as it 
can be seen in Table 8. We observe that Sobel filter im¬ 
proves slightly RotNet features as well. 























Figure 8: We show clusters with an uniform colorization 
accross their images. For each cluster, we show the mean 
color of the cluster. 


RGB 


Sobel 




Figure 9: Visualization of two images preprocessed with 
Sobel filter. Sobel gives a 2 channels output which at each 
point contain the vertical and horizontal derivative approxi¬ 
mations. Photographer usernames of these two YFCC100M 
RGB images are respectively booledozer and nathalie.cone. 


5. Hyperparameters 

In this section, we detail our different hyperparameter 
choices. Images are rescaled to 3 x 224 x 224. Note that for 
each network we choose the best performing hyperparam¬ 
eters by evaluating on Pascal VOC 2007 classification task 
without finetuning. 

• RotNet YFCC100M: we train with a total batch- 
size of 512, a learning rate of 0.05, weight decay of 
0.00001 and dropout of 0.3. 

• RotNet ImageNet: we train with a total batch-size of 
512, a learning rate of 0.05, weight decay of 0.00001 
and dropout of 0.3. 

• DeepCluster YFCC100M 1.3M images: we train 
with a total batch-size of 256, a learning rate of 0.05, 
weight decay of 0.00001 and dropout of 0.5. A so¬ 
bel filter is used in preprocessing step. We cluster the 
pca-reduced to 256 dimensions, whitened and normal¬ 
ized features with k- means into 10.000 clusters every 


2 epochs of training. 

• DeeperCluster YFCC100M: we train with a total 
batch-size of 3072, a learning rate of 0.1, weight decay 
of 0.00001 and dropout of 0.5. A sobel filter is used 
in preprocessing step. We cluster the whitened and 
normalized features (of dimension 4096) of the non- 
rotated images with hierarchical k- means into 320.000 
clusters (4 clusterings in 80.000 clusters each) every 3 
epochs of training. 

• DeeperCluster ImageNet: we train with a total batch- 
size of 748, a learning rate of 0.1, weight decay of 
0.00001 and dropout of 0.5. A sobel filter is used 
in preprocessing step. We cluster the whitened and 
normalized features (of dimension 4096) of the non- 
rotated images with k- means into 10.000 clusters ev¬ 
ery 5 epochs of training. 

For all methods, we use stochastic gradient descent with a 
momentum of 0.9. We stop training as soon as performance 
on Pascal VOC 2007 classification task saturates. We use 
PyTorch version 1.0 for all our experiments. 

6. Usernames of cluster visualization images 

For copyright reason, we give here the Flickr user names 
of the images from Figure 5. For each cluster, the user 
names are listed from left to right and from top to bottom. 
Photographers of images in cluster cat are sun_summer, 
savasavasava, windy_sydney, ironsalchicha, Chiang Kai 
Yen, habigu, Crackers93, rikkis_refuge and rabidgamer. 
Photographers of images in custer elephantparadelondon 
are Karen Roe, asw909, Matt From London, jorgeleria, Loz 
Flowers, Loz Flowers, Deck Accessory, Maxwell Hamil¬ 
ton and Melinda 26 Cristiano. Photographers of images 
in custer always are troutproject, elandru, vlauria, Ray¬ 
mond Yee, tsupo543, masatsu, robotson, edgoubert and 
troutproject. Photographers of images in custer Cano Scan 
are what-i-found, what-i-found, allthepreciousthings, car¬ 
bonated, what-i-found, what-i-found, what-i-found, what-i- 
found and what-i-found. Photographers of images in custer 
GPS: (43, 10) are bloke, garysoccerl, macpalm, MATT 
E O 1 2 3, coderll, Johan.dk, chrissmallwood, markomni 
and xiquinhosilva. Photographers of images in custer 
GPS: (-34, -151) are asamiToku, Scott R Frost, BeauGiles, 
ME ADEN, chaitanyakuber, mathias Straumann, jeroenvan- 
lieshout, jamespia and Bastard Sheep. Photographers of 
images in custer GPS(64, -20) are arrygj, Bsivad, Powys 
Walker, Maria Grazia Dal Pra27, Sterling College, round- 
edby gravity, johnmcga, Muddy Ravine and El coleccionista 
de instantes. Photographers of images in custer GPS: (43, - 
104) are dodds, eric.terry.kc, Lodahln, wmamurphy, purza7, 
jfhatesmustard, Marcel B., Silly America and Liralen Li. 



























